The typical goals in scientific investigations are to identify
Causal inference refers to the design and analysis of data for uncovering causal relationships between treatment/intervention variables and outcome variables.
We care about causal inference because a large proportion of real-life questions of interest are questions of causality, not correlation.
Causality has been of concern since the dawn of civilization.
From: Mesopotamiske farmakopéer, accessed September 26, 2014.
Rigorous statistical frameworks for causality that can be applied to many types of data have been established for only a century.
Designed experiments are the gold standard for causal inference because the controls and physical act of randomization in experiments guarantee, in expectation, no bias due to confounding, and methods for statistical analyses that require no parametric assumptions (Imbens & Rubin, 2015).
Observational studies can yield valid causal inferences if they are designed prior to analysis, so as to approximate a randomized experiment.
Machine learning algorithms can enable powerful causal inferences in Big Data settings with many covariates and treatment effect heterogeneity.
The Rubin Causal Model (RCM) is a rigorous statistical framework for drawing causal inferences from many types of studies.
Applications of the RCM involve three major steps.
☝ There exist other frameworks and models for causal inference. One prominent framework has been developed by Judea Pearl, and is described in Pearl (2009). Due to the focus of the presentations and demonstrations in this boot camp, we will just consider the RCM in this presentation. To learn more about other causal inference frameworks, be sure to watch the recordings of Purdue's Department of Statistics 2021 Distinguished Theme Seminar Series.
Before designing/analyzing an experiment/observational study, it is important to clearly define the Science of the problem.
Science: Define experimental units, covariates, treatments, potential outcomes, estimands.
Experimental unit: physical object(s) at a particular point in time.
An experimental unit assigned a treatment at a particular point in time could have been exposed to an alternative treatment at that same point in time.
Each experimental unit and treatment pair corresponds to a potential outcome.
Unit | Control Potential Outcome | Treatment Potential Outcome |
---|---|---|
$i$ | $Y_i(0)$ | $Y_i(1)$ |
A unit-level causal effect is a comparison of potential outcomes for the same unit at the same point in time post-treatment.
One standard unit-level causal effect is simply $Y_i(1) - Y_i(0)$.
☝ A causal effect is not defined in terms of comparisons of outcomes at different times, as in before-after comparisons.
Unit | Control (0) or Treatment (1)? | $Y_i(0)$ | $Y_i(1)$ | $Y_i(1) - Y_i(0)$ |
---|---|---|---|---|
1 | 1 | ? | ✔ | ? |
For an experimental unit, at most one of the potential outcomes can be realized and thus observed.
To learn about causal effects, we need
Several subtle issues must be considered in the presence of multiple units, and with an assignment mechanism.
The Stable Unit-Treatment Value Assumption (SUTVA, Imbens & Rubin, 2015: p. 10) consists of two conditions on the Science that yield unambiguous potential outcomes and causal effects.
Experiments and observational studies are typically designed to ensure that SUTVA holds.
Unit | $X_i$ | $Y_i(0)$ | $Y_i(1)$ |
---|---|---|---|
1 | $X_1$ | $Y_1(0)$ | $Y_1(1)$ |
2 | $X_2$ | $Y_2(0)$ | $Y_2(1)$ |
$\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ |
$N$ | $X_N$ | $Y_N(0)$ | $Y_N(1)$ |
Finite-population causal estimand: Comparison of potential outcomes for one set of experimental units.
$$ \left \{ Y_i(\mathrm{1}) : i = 1, \ldots, N \right \} \ \mathrm{vs.} \ \left \{ Y_i(\mathrm{0}) : i = 1, \ldots, N \right \} $$The choice of estimand is typically governed by the question of interest for the Science.
One common estimand is the average treatment effect
$$ \tau = \bar{Y}(1) - \bar{Y}(0) = \frac{1}{N} \sum_{i=1}^N Y_i(1) - \frac{1}{N} \sum_{i=1}^N Y_i(0),$$but this is not the only possible causal estimand of interest.
Unit | $X_i$ | Control (0) or Treatment (1)? | $Y_i(0)$ | $Y_i(1)$ | $Y_i(1) - Y_i(0)$ |
---|---|---|---|---|---|
1 | $X_1$ | 1 | ? | ✔ | ? |
2 | $X_2$ | 0 | ✔ | ? | ? |
$\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ |
$N$ | $X_N$ | 1 | ? | ✔ | ? |
We never observe all potential outcomes or all unit-level effects.
Causal inference is a missing data problem, and a key role is played by the assignment mechanism.
Assignment mechanism: Description of how treatments are assigned to the experimental units (and the corresponding outcomes are observed).
Observed outcomes can possibly help us infer causal effects only when we account for how units came to receive their treatments.
☝ Potential outcomes and causal effects are well-defined regardless of the assignment mechanism.
Subject | Drug (0) or Surgery (1)? | $Y_i(0)$ | $Y_i(1)$ | $Y_i(1) - Y_i(0)$ |
---|---|---|---|---|
1 | 1 | 7 | 6 | |
2 | 6 | 5 | -1 | |
3 | 1 | 5 | 4 | |
4 | 8 | 7 | -1 | |
5 | 2 | 3 | 1 | |
6 | 6 | 3 | -3 | |
4 | 5 | 1 |
On average, surgery increases survival time by 1 year.
$$ \bar{Y}(1) - \bar{Y}(0) = \frac{1}{6} \sum_{i=1}^6 Y_i(1) - \frac{1}{6} \sum_{i=1}^6 Y_i(0) = 1 $$Suppose that the doctor running this clinical trial was perfect: she only assigns those procedures to the subjects that give the maximum survival time.
In this case, the observed data would be:
Subject | Drug (0) or Surgery (1)? | $Y_i(0)$ | $Y_i(1)$ | $Y_i(1) - Y_i(0)$ |
---|---|---|---|---|
1 | 1 | ? | 7 | ? |
2 | 0 | 6 | ? | ? |
3 | 1 | ? | 5 | ? |
4 | 0 | 8 | ? | ? |
5 | 1 | ? | 3 | ? |
6 | 0 | 6 | ? | ? |
At first glance, the observed outcomes appear to suggest that surgery is worse than drug on average.
$$ \bar{y}^{\mathrm{obs}}(1) - \bar{y}^{\mathrm{obs}}(0) = -5/3 $$In general, $\bar{y}^{\mathrm{obs}}(1) - \bar{y}^{\mathrm{obs}}(0)$ would not be an appropriate estimator of $\bar{Y}(1) - \bar{Y}(0)$ if the assignment mechanism depends on the potential outcomes.
In this case, this estimator is ignoring the confounding from the Perfect Doctor's (lurking) judgments about assignment of surgery and drug to the subjects, and is accordingly biased.
An estimator in this setting should take into account the relationships between the potential outcomes that governed the Perfect Doctor's assignment mechanism.
Subject | Drug (0) or Surgery (1)? | $Y_i(0)$ | $Y_i(1)$ | $Y_i(1) - Y_i(0)$ |
---|---|---|---|---|
1 | 1 | $\leq 7$ | 7 | $\geq 0$ |
2 | 0 | 6 | $\leq 6$ | $\leq 0$ |
3 | 1 | $\leq 5$ | 5 | $\geq 0$ |
4 | 0 | 8 | $\leq 8$ | $\leq 0$ |
5 | 1 | $\leq 3$ | 3 | $\geq 0$ |
6 | 0 | 6 | $\leq 6$ | $\leq 0$ |
Causal inference requires a consideration of why the units received one treatment rather than another.
One may not be able to obtain valid causal inferences from observed outcomes if one ignores the assignment mechanism.
Knowledge of the assignment mechanism can also be sufficient for certain causal inferences (Imbens & Rubin, 2015).
We proceed to introduce notation and assumptions for assignment mechanisms.
Suppose we have $N$ units, indexed by $i$.
Covariates: Pre-treatment/background characteristics of the experimental units, or their data that are unaffected by treatment assignment.
$X_i$: $K$-component vector of covariates for unit $i$. Each $X_i \in \mathbb{R}^K$.
$\mathbf{X} = \begin{pmatrix} X_1' \\ X_2' \\ \vdots \\ X_N' \end{pmatrix}$: $N \times K$ matrix of covariates for all units.
$Y_i(0), Y_i(1)$: control and treatment potential outcomes for unit $i$.
$\mathbf{Y}(0) = \begin{pmatrix} Y_1(0) \\ Y_2(0) \\ \vdots \\ Y_N(0) \end{pmatrix}$: $N$-component vector of control potential outcomes for all units.
$\mathbf{Y}(1) = \begin{pmatrix} Y_1(1) \\ Y_2(1) \\ \vdots \\ Y_N(1) \end{pmatrix}$: $N$-component vector of treatment potential outcomes for all units.
☝ We assume SUTVA holds.
$W_i$: treatment assignment indicator for unit $i$.
$$ W_i = \begin{cases} 1 & \mbox{if unit} \ i \ \mbox{is assigned treatment,} \\ 0 & \mbox{if unit} \ i \ \mbox{is assigned control}. \end{cases} $$$\mathbf{W} = \begin{pmatrix} W_1 \\ W_2 \\ \vdots \\ W_N \end{pmatrix}$: $N$-component vector of treatment assignment indicators for all units.
We shall typically denote the number of units assigned treatment and control as, respectively, $$ N_1 = \sum_{i=1}^N W_i, $$ $$ N_0 = \sum_{i=1}^N \left ( 1-W_i \right ). $$
Observed outcomes are functions of potential outcomes and treatment assignment indicators.
$$ y_i^{\mathrm{obs}} = W_iY_i(1) + (1-W_i)Y_i(0) $$$$ y_i^{\mathrm{mis}} = (1-W_i)Y_i(1) + W_iY_i(0) $$☝ Potential outcomes and causal effects are well-defined regardless of treatment assignment indicators.
⚠ Do not confuse potential outcomes with observed outcomes!
The assignment mechanism is a description of the probability of any vector of assignments for the entire population.
Assignment mechanism: $\mathrm{Pr}\{\mathbf{W} \mid \mathbf{X}, \mathbf{Y}(0), \mathbf{Y}(1) \}$.
☝ $\displaystyle \sum_{\mathbf{w} \in \{0,1\}^N} \mathrm{Pr}\{\mathbf{W} = \mathbf{w} \mid \mathbf{X}, \mathbf{Y}(0), \mathbf{Y}(1) \} = 1$ (Imbens & Rubin, 2015: p. 34).
Unit-level assignment probability for unit $i$: $p_i(\mathbf{X}, \mathbf{Y}(0), \mathbf{Y}(1)) = \displaystyle \sum_{\mathbf{w}: w_i = 1} \mathrm{Pr}\{\mathbf{W} = \mathbf{w} \mid \mathbf{X}, \mathbf{Y}(0), \mathbf{Y}(1) \}$.
There exists a function $q: \mathbb{R}^{K+2} \rightarrow (0,1)$ such that for all subjects $i$, $$ p_i(\mathbf{X}, \mathbf{Y}(0), \mathbf{Y}(1)) = q(X_i, Y_i(0), Y_i(1)) $$ and $$\mathrm{Pr} \{ \mathbf{W} \mid \mathbf{X}, \mathbf{Y}(0), \mathbf{Y}(1) \} = c \displaystyle \prod_{i=1}^N q(X_i, Y_i(0), Y_i(1))^{W_i}\{1-q(X_i,Y_i(0),Y_i(1))\}^{1-W_i} $$ for some set of $(\mathbf{W}, \mathbf{X}, \mathbf{Y}(0), \mathbf{Y}(1))$, where $c$ is the normalization constant for the probability mass function of the treatment assignment mechanism.
💡 If an assignment mechanism is not individualistic, then some subjects' treatment assignments would depend on the covariates and potential outcomes of other subjects, which would complicate the design and analysis of experiments or observational studies.
For all subjects $i$, $0 < p_i(\mathbf{X}, \mathbf{Y}(0), \mathbf{Y}(1)) < 1$.
💡 A probabilistic assignment mechanism permits the consideration of all subjects for the design and analysis of an experiment or observational study, and reduces the risk of extrapolation biases when estimating treatment effects.
For any $\mathbf{w} \in \{0, 1\}^N$ and potential outcome vectors $\mathbf{Y}(0), \mathbf{Y}'(0), \mathbf{Y}(1), \mathbf{Y}'(1) \in \mathbb{R}^N$, $$ \mathrm{Pr} \{ \mathbf{W} = \mathbf{w} \mid \mathbf{X}, \mathbf{Y}(0), \mathbf{Y}(1) \} = \mathrm{Pr} \{ \mathbf{W} = \mathbf{w} \mid \mathbf{X}, \mathbf{Y}'(0), \mathbf{Y}'(1) \}. $$
💡 No lurking confounders! The observed covariates contain all the information governing treatment assignment, and no additional variables associated with the outcomes are related to treatment assignment.
Regular assignment mechanism/Strongly ignorable treatment assignment: Combination of the individualistic, probabilistic, and unconfounded assignment assumptions.
💡 A regular assignment mechanism justifies designing an observational study so as to compare treated and control subjects with the same covariates for inferring causal effects.
If a treated and control subject have the same covariates, then their treatment assignments have effectively been performed according to some random mechanism.
Comparing treated and control subjects with the same covariates should thus yield unbiased inferences for the treatment effects in the designed observational study.
Propensity score for a regular assignment mechanism: Unit-level assignment probability, typically denoted by the function $e: \mathbb{R}^K \rightarrow (0,1)$.
Experimental Unit $i$ | $\mathbf{X}_i$ | $W_i$ | $Y_{i}(0)$ | $Y_{i}(1)$ |
---|---|---|---|---|
$1$ | $\mathbf{X}_1$ | $W_1$ | $Y_{1}(0)$ | $Y_{1}(1)$ |
$2$ | $\mathbf{X}_2$ | $W_2$ | $Y_{2}(0)$ | $Y_{2}(1)$ |
$\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ |
$N$ | $\mathbf{X}_N$ | $W_N$ | $Y_{N}(0)$ | $Y_{N}(1)$ |
Machine learning is popular for deriving causal inferences from experiments and (especially) observational studies.
Machine learning enables the powerful incorporation of covariates for causal inference in the presence of complicated treatment effect heterogeneity.
One standard approach for deriving causal inferences via machine learning corresponds to specifying a model on the observed outcomes, with the predictors consisting of the treatment indicators and covariates.
In certain cases, inferences on the unknown parameters in the machine learning algorithm can be obtained, but without additional assumptions such inferences are technically correlational, not causal.
Consideration of the Rubin Causal Model helps to illuminate valid applications of machine learning algorithms for causal inferences.
💡 The RCM enables one to specify a consistent definition of a causal estimand, i.e., one that does not change depending on the method of analysis for the observed data. Accordingly, causal inferences obtained from different analytical approaches can be directly compared. Also, one does not have to specify a causal estimand in terms of parameters in a model or machine learning algorithm.
Experimental Units
$445$ men in 1976.
Covariates
Age, years of education, ethnicity, etc.
Treatment Factor
Activity, with two levels: job training (1) or nothing (0).
Potential Outcome
Annual earnings in 1978 (in dollars).
Assignment Mechanism
CRD with $185$ treated units ($N_1 = 185, N_0 = 260$).
Question of Interest
What is the causal effect of the job training program, accounting for the subjects' covariates?
⚠ For the purpose of today's introductory presentation we give a simple, illustrative case study here.
Lalonde_data = read.csv("Lalonde_data.csv", header=T)
Lalonde_data = Lalonde_data[,-c(1,14)]
Lalonde_data = Lalonde_data[,c(3:12,2,1)]
head(Lalonde_data)
MARR | NODEGREE | BLACK | HISPANIC | EDUC | AGE | RE74 | RE75 | U74 | U75 | TREAT | RE78 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
<int> | <int> | <int> | <int> | <int> | <int> | <dbl> | <dbl> | <int> | <int> | <int> | <dbl> | |
1 | 0 | 1 | 1 | 0 | 10 | 23 | 0 | 0 | 1 | 1 | 0 | 0.0 |
2 | 0 | 0 | 0 | 0 | 12 | 26 | 0 | 0 | 1 | 1 | 0 | 12383.7 |
3 | 0 | 1 | 1 | 0 | 9 | 22 | 0 | 0 | 1 | 1 | 0 | 0.0 |
4 | 0 | 1 | 1 | 0 | 9 | 18 | 0 | 0 | 1 | 1 | 0 | 10740.1 |
5 | 0 | 1 | 1 | 0 | 11 | 45 | 0 | 0 | 1 | 1 | 0 | 11796.5 |
6 | 0 | 1 | 1 | 0 | 9 | 18 | 0 | 0 | 1 | 1 | 0 | 9227.1 |
Unit | $\mathbf{X}_i$ | $W_i$ | $Y_i(0)$ | $Y_i(1)$ |
---|---|---|---|---|
1 | $\mathbf{X}_1$ | 0 | ✔ | ? |
2 | $\mathbf{X}_2$ | 1 | ? | ✔ |
$\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ |
N | $\mathbf{X}_N$ | 1 | ? | ✔ |
Super-population perspective (Imbens & Rubin, 2015: p. 113 - 114)
Experimental units are drawn from an infinite super-population of units.
Models are specified for the conditional mean of potential outcomes in the super-population.
The typical causal estimand is the average treatment effect, and can either be formulated in terms of a parameter in the machine learning algorithm (if possible), or can be formulated by means of average predictive comparisons (Gelman and Pardoe, 2007).
The validity of the models, in terms of whether they accurately describe the conditional mean, is immaterial for the large-sample unbiasedness of the estimator of the average treatment effect.
All results (bias, variance, etc.) are asymptotic (large sample) results.
Finite-population perspective
No assumptions for how the experimental units came to be enrolled into the study are necessary.
Machine learning algorithms are specified for the conditional distributions so as to impute missing potential outcomes and derive posterior distributions for causal estimands of interest.
Causal estimands are defined as comparisons of potential outcomes for the finite population of the experimental units in the study.
One is not limited to considering just average treatment effects as the causal estimands.
Causal estimands are separated from the potential outcome models induced by machine learning algorithms (so that the Science is separated from Learning).
The validity of the machine learning algorithm could be important for the consistency of the causal effect estimators, and should be diagnosed before drawing final conclusions.
The physical act of randomization in completely randomized designs can yield robustness to model misspecification (Garcia-Horton, 2015).
Certain machine learning algorithms can induce a type of probability model for the potential outcomes.
The set of potential outcomes $(Y_i(0), Y_i(1))$ is equivalent to the set of observed and missing potential outcomes $(y_i^{\mathrm{obs}}, y_i^{\mathrm{mis}})$, where
$$ y_i^{\mathrm{obs}} = W_iY_i(1) + (1-W_i)Y_i(0), $$$$ y_i^{\mathrm{mis}} = (1-W_i)Y_i(1) + W_iY_i(0). $$If a distribution $p \left (y_i^{\mathrm{mis}} \mid \mathbf{X}, \mathbf{y}^{\mathrm{obs}}, \mathbf{W} \right )$ can be specified based on the machine learning algorithm, then each $y_i^{\mathrm{mis}}$ can be imputed, and the value of the causal estimand under the imputation can be calculated.
The imputation is not perfect, as the predictions obtained from a machine learning algorithm are not perfect.
Multiple imputations of the missing potential outcomes can enable one to perform causal inferences that reflect uncertainty due to the assignment mechanism (i.e., due to the missingness in potential outcomes).
A key assumption for the valid application of multiple imputation is that the assignment mechanism is unconfounded (in addition to probabilistic and individualistic).
Under the finite-population perspective, the parameters $\boldsymbol{\theta}$ of a machine learning algorithm are effectively nuisance parameters, and should be integrated out.
$$ p \left (y_i^{\mathrm{mis}} \mid \mathbf{X}, \mathbf{y}^{\mathrm{obs}}, \mathbf{W} \right ) = \int p \left (y_i^{\mathrm{mis}} \mid \boldsymbol{\theta}, \mathbf{X}, \mathbf{y}^{\mathrm{obs}}, \mathbf{W} \right ) p \left ( \boldsymbol{\theta} \mid \mathbf{X}, \mathbf{y}^{\mathrm{obs}}, \mathbf{W} \right )d\theta $$Ideally, the integration of the nuisance model parameters should be performed under the Bayesian paradigm, so as to correctly propagate the uncertainties associated with the parameters.
A complication that could potentially arise under the Bayesian paradigm is deriving $p \left ( \boldsymbol{\theta} \mid \mathbf{X}, \mathbf{y}^{\mathrm{obs}}, \mathbf{W} \right )$.
If draws can be obtained from $p \left ( \boldsymbol{\theta} \mid \mathbf{X}, \mathbf{y}^{\mathrm{obs}}, \mathbf{W} \right )$, then draws can immediately be obtained from $p \left (y_i^{\mathrm{mis}} \mid \mathbf{X}, \mathbf{y}^{\mathrm{obs}}, \mathbf{W} \right )$ by virtue of the integral above.
☝ The $\int$ operation corresponds to both "summa" and "simulate"!
In this manner, implementing the multiple imputation approach under the Bayesian paradigm is conceptually straightforward.
The bootstrap (Efron, 1979) can be used to approximate the Bayesian approach for multiple imputation, i.e., to approximate the integration of unknown (nuisance) parameters.
In particular, we create a bootstrap distribution to approximate $p \left ( \boldsymbol{\theta} \mid \mathbf{X}, \mathbf{y}^{\mathrm{obs}}, \mathbf{W} \right )$ (Little and Rubin, 2002: p. 216 - 217).
For bootstrap sample $m = 1, \ldots, M$:
Draw $\tilde{\theta}^{(m)} \sim \mathcal{B} \left ( \boldsymbol{\theta} \mid \mathbf{X}, \mathbf{y}^{\mathrm{obs}}, \mathbf{W} \right )$.
Draw $\mathbf{y}^{\mathrm{mis}, (m)} \sim p \left ( \mathbf{y}^{\mathrm{mis}} \mid \mathbf{X}, \mathbf{y}^{\mathrm{obs}}, \mathbf{W}, \tilde{\theta}^{(m)} \right )$.
Calculate the causal estimand of interest based on the imputed dataset.
We use the randomForest package in R and the nonparametric bootstrap to infer the treatment effect in the job training program.
⚠ In the interest of time, and for the purpose of giving an illustrative case study, we are skipping several important steps associated with analyzing data via machine learning algorithms.
More detailed treatments of causal inference via random forests are provided by Lu et al. (2018) and Künzel et al. (2019).
#install.packages("randomForest")
library(randomForest)
set.seed(1)
number_of_bootstrap_draws = 10^3
treatment_effect_estimates = rep(NA, number_of_bootstrap_draws)
Lalonde_data_treated = Lalonde_data[Lalonde_data$TREAT==1,]
Lalonde_data_control = Lalonde_data[Lalonde_data$TREAT==0,]
Lalonde_randomForest_treated = randomForest(x=Lalonde_data_treated[,1:10],
y=Lalonde_data_treated[,12])
Lalonde_randomForest_control = randomForest(x=Lalonde_data_control[,1:10],
y=Lalonde_data_control[,12])
treatment_effect_estimate = mean(c((Lalonde_data_treated[,12] -
predict(Lalonde_randomForest_control, Lalonde_data_treated[,1:10])),
(predict(Lalonde_randomForest_treated, Lalonde_data_control[,1:10]) -
Lalonde_data_control[,12])))
print(paste("The treatment effect estimate obtained from random forests is:", treatment_effect_estimate))
progress_bar = txtProgressBar(min=1, max=number_of_bootstrap_draws, style = 3)
for(i in 1:number_of_bootstrap_draws)
{
Lalonde_data_new_treated = Lalonde_data_treated[sample((1:nrow(Lalonde_data_treated)), replace=TRUE),]
Lalonde_randomForest_new_treated = randomForest(x=Lalonde_data_new_treated[,1:10],
y=Lalonde_data_new_treated[,12])
Lalonde_data_new_control = Lalonde_data_control[sample((1:nrow(Lalonde_data_control)), replace=TRUE),]
Lalonde_randomForest_new_control = randomForest(x=Lalonde_data_new_control[,1:10],
y=Lalonde_data_new_control[,12])
treatment_effect_estimates[i] = mean(c((Lalonde_data_new_treated[,12] -
predict(Lalonde_randomForest_new_control, Lalonde_data_new_treated[,1:10])),
(predict(Lalonde_randomForest_new_treated, Lalonde_data_new_control[,1:10]) -
Lalonde_data_new_control[,12])))
setTxtProgressBar(progress_bar, i)
}
close(progress_bar)
hist(treatment_effect_estimates, main="Bootstrap Distribution of Estimates of Causal Estimand", xlab="Treatment Effect")
mean(treatment_effect_estimates)
sd(treatment_effect_estimates)
quantile(treatment_effect_estimates, prob=c(0.025,0.25,0.5,0.75,0.975))
randomForest 4.6-14 Type rfNews() to see new features/changes/bug fixes.
[1] "The treatment effect estimate obtained from random forests is: 1625.08650762209" |======================================================================| 100%
☝ The estimates of the unit-level causal effects are based on the observed and imputed potential outcomes.
Lalonde_linear_model = lm(RE78 ~ MARR + NODEGREE + BLACK + HISPANIC + EDUC + AGE + RE74 + RE75 + U74 + U75 + TREAT,
data=Lalonde_data)
summary(Lalonde_linear_model)
#1.671e+03 + c(-1,1)*qt(0.975, df=433)*6.411e+02
Call: lm(formula = RE78 ~ MARR + NODEGREE + BLACK + HISPANIC + EDUC + AGE + RE74 + RE75 + U74 + U75 + TREAT, data = Lalonde_data) Residuals: Min 1Q Median 3Q Max -9612 -4355 -1572 3054 53119 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.567e+02 3.522e+03 0.073 0.94193 MARR -1.463e+02 8.823e+02 -0.166 0.86835 NODEGREE -1.518e+01 1.006e+03 -0.015 0.98796 BLACK -2.037e+03 1.174e+03 -1.736 0.08331 . HISPANIC 4.258e+02 1.565e+03 0.272 0.78562 EDUC 4.008e+02 2.288e+02 1.751 0.08058 . AGE 5.357e+01 4.581e+01 1.170 0.24284 RE74 1.234e-01 8.784e-02 1.405 0.16080 RE75 1.974e-02 1.503e-01 0.131 0.89554 U74 1.380e+03 1.188e+03 1.162 0.24590 U75 -1.071e+03 1.025e+03 -1.045 0.29651 TREAT 1.671e+03 6.411e+02 2.606 0.00948 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 6517 on 433 degrees of freedom Multiple R-squared: 0.05822, Adjusted R-squared: 0.0343 F-statistic: 2.433 on 11 and 433 DF, p-value: 0.005974
An observational study involves an assignment mechanism whose functional form is unknown.
The analyses of observational studies are typically complicated by systematic differences in covariates across the different treatments.
Such differences can confound inferences and introduce bias.
Cochran (1965, 1968), Rubin (1973): Observational studies can be designed so as to reduce biases due to confounding, and yield valid causal inferences.
☝ Randomized experiments typically have the desirable feature that if sufficient care is taken in their design/randomization, then different statistical techniques will yield the correct answer, potentially with different levels of precision.
#install.packages("vioplot")
library(vioplot)
options(warn=-1)
experiment_data = Lalonde_data
numerical_covariate_balance = function(treated, control, covariate_name)
{
mean_t = mean(treated)
mean_c = mean(control)
var_t = var(treated)
var_c = var(control)
pooled_var = (var_t*(length(treated)-1) + var_c*(length(control)-1))/(length(treated)+length(control)-2)
standard_err = (pooled_var*(1/length(treated) + 1/length(control)))^0.5
degrees_of_freedom = length(treated) + length(control) - 2
test_statistic = (mean_t - mean_c)/(standard_err)
p_value = 2*min(pt(test_statistic, df=degrees_of_freedom, lower.tail=TRUE),
pt(test_statistic, df=degrees_of_freedom, lower.tail=FALSE))
vioplot(treated, control, names=c("Treatment", "Control"), col="cyan", horizontal=TRUE, side="right")
print(paste(covariate_name,":", mean_t, ",", mean_c, ",", p_value))
}
categorical_covariate_balance = function(treated, control, covariate_name)
{
p_value = prop.test(x=c(sum(treated),sum(control)),
n=c(length(treated), length(control)),
alternative="two.sided", correct=TRUE)$p.value
print(paste(covariate_name,":", mean(treated), ",", mean(control), ",", p_value))
}
categorical_covariate_balance(experiment_data$MARR[experiment_data$TREAT==1],
experiment_data$MARR[experiment_data$TREAT==0],
"Marriage")
categorical_covariate_balance(experiment_data$NODEGREE[experiment_data$TREAT==1],
experiment_data$NODEGREE[experiment_data$TREAT==0],
"No High School Degree")
categorical_covariate_balance(experiment_data$BLACK[experiment_data$TREAT==1],
experiment_data$BLACK[experiment_data$TREAT==0],
"African-American")
categorical_covariate_balance(experiment_data$HISPANIC[experiment_data$TREAT==1],
experiment_data$HISPANIC[experiment_data$TREAT==0],
"Hispanic")
categorical_covariate_balance(experiment_data$U74[experiment_data$TREAT==1],
experiment_data$U74[experiment_data$TREAT==0],
"Unemployed in 1974")
categorical_covariate_balance(experiment_data$U75[experiment_data$TREAT==1],
experiment_data$U75[experiment_data$TREAT==0],
"Unemployed in 1975")
numerical_covariate_balance(experiment_data$EDUC[experiment_data$TREAT==1],
experiment_data$EDUC[experiment_data$TREAT==0],
"Education")
numerical_covariate_balance(experiment_data$AGE[experiment_data$TREAT==1],
experiment_data$AGE[experiment_data$TREAT==0],
"Age")
numerical_covariate_balance(experiment_data$RE74[experiment_data$TREAT==1],
experiment_data$RE74[experiment_data$TREAT==0],
"1974 Income")
numerical_covariate_balance(experiment_data$RE75[experiment_data$TREAT==1],
experiment_data$RE75[experiment_data$TREAT==0],
"1975 Income")
Loading required package: sm Package 'sm', version 2.2-5.6: type help(sm) for summary information Loading required package: zoo Attaching package: 'zoo' The following objects are masked from 'package:base': as.Date, as.Date.numeric
[1] "Marriage : 0.189189189189189 , 0.153846153846154 , 0.393600016056372" [1] "No High School Degree : 0.708108108108108 , 0.834615384615385 , 0.00214685437568014" [1] "African-American : 0.843243243243243 , 0.826923076923077 , 0.744021068468005" [1] "Hispanic : 0.0594594594594595 , 0.107692307692308 , 0.108868987268592" [1] "Unemployed in 1974 : 0.708108108108108 , 0.75 , 0.381380629582063" [1] "Unemployed in 1975 : 0.6 , 0.684615384615385 , 0.0813493975970139" [1] "Education : 10.3459459459459 , 10.0884615384615 , 0.135411167169439"
[1] "Age : 25.8162162162162 , 25.0538461538462 , 0.264764268805686"
[1] "1974 Income : 2095.57405405405 , 2107.02538461538 , 0.982320764091842"
[1] "1975 Income : 1532.05675675676 , 1266.91 , 0.382253154580966"
Covariate | $\bar{X}_1$ | $\bar{X}_0$ | $p$-value |
---|---|---|---|
Marriage | $0.19$ | $0.15$ | $0.327$ |
No Degree | $0.71$ | $0.83$ | $0.0014$ |
Black | $0.84$ | $0.83$ | $0.649$ |
Hispanic | $0.06$ | $0.11$ | $0.076$ |
Years Education | $10.35$ | $10.09$ | $0.14$ |
Age | $25.82$ | $25.05$ | $0.265$ |
$1974$ Earnings | $2095.57$ | $2107.03$ | $0.98$ |
$1975$ Earnings | $1532.07$ | $1266.91$ | $0.382$ |
Unemployed $1974$ | $0.71$ | $0.75$ | $0.326$ |
Unemployed $1975$ | $0.60$ | $0.68$ | $0.065$ |
LaLonde (1986), and later Dehejia & Wahba (1999), considered what would happen if traditional statistical methodologies were used to study data from an observational study, not an experiment.
As we already know the "correct" answer for this job training program experiment, they constructed an observational study using just the treated units in this experiment, and investigated whether they would get the same "correct" answer from the observational study.
Specifically, suppose we had data on only the 185 treated units from the job training program, and did not know a great deal about how these units came to be treated.
To learn about the effect of the job training program on annual earnings, we form a control group by collecting data on 2490 units drawn from the Panel Study of Income Dynamics.
We collect the same covariates and response for all units.
🤔 Do we get the correct causal inferences for the effect of the job training program on annual earnings from this observational study?
observational_data = read.table("Lalonde_observational_data.txt", header=T, sep="")
observational_data_treated = observational_data[observational_data$TREAT==1,]
observational_data_control = observational_data[observational_data$TREAT==0,]
observational_randomForest_treated = randomForest(x=observational_data_treated[,3:12],
y=observational_data_treated[,1])
observational_randomForest_control = randomForest(x=observational_data_control[,3:12],
y=observational_data_control[,1])
treatment_effect_estimate = mean(c((observational_data_treated[,1] -
predict(observational_randomForest_control, observational_data_treated[,3:12])),
(predict(observational_randomForest_treated, observational_data_control[,3:12]) -
observational_data_control[,1])))
print(paste("The treatment effect estimate obtained from random forests is:", treatment_effect_estimate))
number_of_bootstrap_draws = 10^3
treatment_effect_estimates = rep(NA, number_of_bootstrap_draws)
observational_data_treated = observational_data[observational_data$TREAT==1,]
observational_data_control = observational_data[observational_data$TREAT==0,]
progress_bar = txtProgressBar(min=1, max=number_of_bootstrap_draws, style = 3)
for(i in 1:number_of_bootstrap_draws)
{
observational_data_new_treated = observational_data_treated[sample((1:nrow(observational_data_treated)), replace=TRUE),]
observational_randomForest_new_treated = randomForest(x=observational_data_new_treated[,3:12],
y=observational_data_new_treated[,1])
observational_data_new_control = observational_data_control[sample((1:nrow(observational_data_control)), replace=TRUE),]
observational_randomForest_new_control = randomForest(x=observational_data_new_control[,3:12],
y=observational_data_new_control[,1])
treatment_effect_estimates[i] = mean(c((observational_data_new_treated[,1] -
predict(observational_randomForest_new_control,
observational_data_new_treated[,3:12])),
(predict(observational_randomForest_new_treated,
observational_data_new_control[,3:12]) -
observational_data_new_control[,1])))
setTxtProgressBar(progress_bar, i)
}
close(progress_bar)
hist(treatment_effect_estimates, main="Bootstrap Distribution of Estimates of Causal Estimand", xlab="Treatment Effect")
mean(treatment_effect_estimates)
sd(treatment_effect_estimates)
quantile(treatment_effect_estimates, prob=c(0.025,0.25,0.5,0.75,0.975))
[1] "The treatment effect estimate obtained from random forests is: -9614.72259267092" |======================================================================| 100%
Machine learning did not correct the bias introduced by imbalances in the covariates (Imbens & Rubin, 2015: p. 277)!
categorical_covariate_balance(observational_data$MARR[observational_data$TREAT==1],
observational_data$MARR[observational_data$TREAT==0],
"Marriage")
categorical_covariate_balance(observational_data$NODEGREE[observational_data$TREAT==1],
observational_data$NODEGREE[observational_data$TREAT==0],
"No High School Degree")
categorical_covariate_balance(observational_data$BLACK[observational_data$TREAT==1],
observational_data$BLACK[observational_data$TREAT==0],
"African-American")
categorical_covariate_balance(observational_data$HISPANIC[observational_data$TREAT==1],
observational_data$HISPANIC[observational_data$TREAT==0],
"Hispanic")
categorical_covariate_balance(observational_data$U74[observational_data$TREAT==1],
observational_data$U74[observational_data$TREAT==0],
"Unemployed in 1974")
categorical_covariate_balance(observational_data$U75[observational_data$TREAT==1],
observational_data$U75[observational_data$TREAT==0],
"Unemployed in 1975")
numerical_covariate_balance(observational_data$EDUC[observational_data$TREAT==1],
observational_data$EDUC[observational_data$TREAT==0],
"Education")
numerical_covariate_balance(observational_data$AGE[observational_data$TREAT==1],
observational_data$AGE[observational_data$TREAT==0],
"Age")
numerical_covariate_balance(observational_data$RE74[observational_data$TREAT==1],
observational_data$RE74[observational_data$TREAT==0],
"1974 Income")
numerical_covariate_balance(observational_data$RE75[observational_data$TREAT==1],
observational_data$RE75[observational_data$TREAT==0],
"1975 Income")
[1] "Marriage : 0.189189189189189 , 0.866265060240964 , 4.68007728305411e-117" [1] "No High School Degree : 0.708108108108108 , 0.305220883534137 , 8.32164017836424e-29" [1] "African-American : 0.843243243243243 , 0.250602409638554 , 5.12399430234964e-65" [1] "Hispanic : 0.0594594594594595 , 0.0325301204819277 , 0.0836134327225996" [1] "Unemployed in 1974 : 0.708108108108108 , 0.0863453815261044 , 2.22130836324017e-129" [1] "Unemployed in 1975 : 0.6 , 0.1 , 1.91495775994449e-81" [1] "Education : 10.3459459459459 , 12.1168674698795 , 2.00780829375752e-14"
[1] "Age : 25.8162162162162 , 34.8506024096386 , 3.09494054870162e-30"
[1] "1974 Income : 2095.57405405405 , 19428.7458428916 , 5.60415114162859e-65"
[1] "1975 Income : 1532.05675675676 , 19063.3375870281 , 5.44370824933261e-65"
Covariate | $\bar{X}_1$ | $\bar{X}_0$ | $p$-value |
---|---|---|---|
Marriage | $0.19$ | $0.87$ | $\approx 0$ |
No Degree | $0.71$ | $0.31$ | $\approx 0$ |
Black | $0.84$ | $0.25$ | $\approx 0$ |
Hispanic | $0.06$ | $0.03$ | $0.05$ |
Years Education | $10.35$ | $12.1$ | $\approx 0$ |
Age | $25.82$ | $34.85$ | $\approx 0$ |
$1974$ Earnings | $2095.57$ | $19429$ | $\approx 0$ |
$1975$ Earnings | $1532.07$ | $19063$ | $\approx 0$ |
Unemployed $1974$ | $0.71$ | $0.09$ | $\approx 0$ |
Unemployed $1975$ | $0.60$ | $0.1$ | $\approx 0$ |
In general, treated and control units in an observational study will differ (sometimes immensely) in terms of their covariates.
Bias in estimation of the treatment effect will then be introduced by imbalances in covariates.
Lalonde (1986): Traditional observational studies are fundamentally flawed.
Intuitively, we need to discard control units who are not comparable to treated units, so as to compare "like with like" and remove such biases.
🤔 How do we determine the units to discard in the presence of many covariates?
Just as in designed experiments, the propensity score plays a big role in the design, and ultimately analysis, of an observational study.
The propensity score for a unit $i$ is defined as (excluding technicalities at this point) $$ e(X_i) = \mathrm{Pr}(W_i = 1 \mid X_i). $$
💡 An estimated propensity score serves as a one-dimensional summary of all covariates.
If propensity scores are constant across treatment and control groups, then the covariates are identically distributed across the groups (Imbens & Rubin, 2015: p. 266, 277).
Discarding units with extreme propensity scores, and further subclassifying/matching the remaining units with respect to their propensity scores, can reduce the bias in estimating causal effects from observational studies (Imbens & Rubin, 2015: Parts III, IV).
Consider estimation of propensity scores based on a logistic regression with main effects for all covariates.
Control units with estimated propensity scores lower than the minimum of the treated units' estimated propensity scores, or larger than the maximum of the treated units' estimated propensity scores, are discarded.
These units are irrelevant in any analyses we wish to perform.
We want to infer the treatment effect on units who resemble those in the treatment group, and we do not intend to extrapolate.
After discarding these units, 1208 control units remain.
propensity_score_model = glm(TREAT ~ MARR + NODEGREE + BLACK + HISPANIC + EDUC + AGE + RE74 + RE75 + U74 + U75,
data=observational_data, family=binomial)
propensity_scores = propensity_score_model$fitted.values
hist(propensity_scores[1:185], main="Estimated Propensity Scores for Treated Units", xlab="Propensity Score")
hist(propensity_scores[186:2675], main="Estimated Propensity Scores for Control Units", xlab="Propensity Score")
min_treat_prop = min(propensity_scores[1:185])
max_treat_prop = max(propensity_scores[1:185])
data = cbind(observational_data, propensity_scores)
treated_units = data[1:185,]
new_control_units = data[data$TREAT==0 & data$propensity_scores>=min_treat_prop & data$propensity_scores<=max_treat_prop,]
new_data = rbind(treated_units, new_control_units)
hist(propensity_scores[1:185], main="Estimated Propensity Scores for Treated Units", xlab="Propensity Score")
hist(new_control_units[,14], main="Estimated Propensity Scores for Remaining Control Units", xlab="Propensity Score")
categorical_covariate_balance(new_data$MARR[new_data$TREAT==1],
new_data$MARR[new_data$TREAT==0],
"Marriage")
categorical_covariate_balance(new_data$NODEGREE[new_data$TREAT==1],
new_data$NODEGREE[new_data$TREAT==0],
"No High School Degree")
categorical_covariate_balance(new_data$BLACK[new_data$TREAT==1],
new_data$BLACK[new_data$TREAT==0],
"African-American")
categorical_covariate_balance(new_data$HISPANIC[new_data$TREAT==1],
new_data$HISPANIC[new_data$TREAT==0],
"Hispanic")
categorical_covariate_balance(new_data$U74[new_data$TREAT==1],
new_data$U74[new_data$TREAT==0],
"Unemployed in 1974")
categorical_covariate_balance(new_data$U75[new_data$TREAT==1],
new_data$U75[new_data$TREAT==0],
"Unemployed in 1975")
numerical_covariate_balance(new_data$EDUC[new_data$TREAT==1],
new_data$EDUC[new_data$TREAT==0],
"Education")
numerical_covariate_balance(new_data$AGE[new_data$TREAT==1],
new_data$AGE[new_data$TREAT==0],
"Age")
numerical_covariate_balance(new_data$RE74[new_data$TREAT==1],
new_data$RE74[new_data$TREAT==0],
"1974 Income")
numerical_covariate_balance(new_data$RE75[new_data$TREAT==1],
new_data$RE75[new_data$TREAT==0],
"1975 Income")
[1] "Marriage : 0.189189189189189 , 0.780629139072848 , 1.09554668599274e-59" [1] "No High School Degree : 0.708108108108108 , 0.410596026490066 , 6.62441037350883e-14" [1] "African-American : 0.843243243243243 , 0.43294701986755 , 5.84073058508118e-25" [1] "Hispanic : 0.0594594594594595 , 0.048841059602649 , 0.66360489445241" [1] "Unemployed in 1974 : 0.708108108108108 , 0.165562913907285 , 5.45838970175597e-58" [1] "Unemployed in 1975 : 0.6 , 0.204470198675497 , 5.61158183055686e-30"
[1] "Education : 10.3459459459459 , 11.2557947019868 , 7.14192277404495e-05"
[1] "Age : 25.8162162162162 , 31.8998344370861 , 2.60148220574681e-14"
[1] "1974 Income : 2095.57405405405 , 11051.4292944536 , 2.05524196773978e-44"
[1] "1975 Income : 1532.05675675676 , 9359.71751937086 , 6.39731949833502e-48"
Covariate | $\bar{X}_1$ | $\bar{X}_0$ | $p$-value |
---|---|---|---|
Marriage | $0.19$ | $0.78$ | $\approx 0$ |
No Degree | $0.71$ | $0.41$ | $\approx 0$ |
Black | $0.84$ | $0.43$ | $\approx 0$ |
Hispanic | $0.06$ | $0.05$ | $0.66$ |
Years Education | $10.35$ | $11.3$ | $\approx 0$ |
Age | $25.82$ | $31.9$ | $\approx 0$ |
$1974$ Earnings | $2095.57$ | $11051$ | $\approx 0$ |
$1975$ Earnings | $1532.07$ | $936$ | $\approx 0$ |
Unemployed $1974$ | $0.71$ | $0.17$ | $\approx 0$ |
Unemployed $1975$ | $0.60$ | $0.2$ | $\approx 0$ |
observational_data_treated = new_data[new_data$TREAT==1,]
observational_data_control = new_data[new_data$TREAT==0,]
observational_randomForest_treated = randomForest(x=observational_data_treated[,3:12],
y=observational_data_treated[,1])
observational_randomForest_control = randomForest(x=observational_data_control[,3:12],
y=observational_data_control[,1])
treatment_effect_estimate = mean(c((observational_data_treated[,1] -
predict(observational_randomForest_control, observational_data_treated[,3:12])),
(predict(observational_randomForest_treated, observational_data_control[,3:12]) -
observational_data_control[,1])))
print(paste("The treatment effect estimate obtained from random forests is:", treatment_effect_estimate))
number_of_bootstrap_draws = 10^3
treatment_effect_estimates = rep(NA, number_of_bootstrap_draws)
progress_bar = txtProgressBar(min=1, max=number_of_bootstrap_draws, style = 3)
for(i in 1:number_of_bootstrap_draws)
{
observational_data_new_treated = observational_data_treated[sample((1:nrow(observational_data_treated)), replace=TRUE),]
observational_randomForest_new_treated = randomForest(x=observational_data_new_treated[,3:12],
y=observational_data_new_treated[,1])
observational_data_new_control = observational_data_control[sample((1:nrow(observational_data_control)), replace=TRUE),]
observational_randomForest_new_control = randomForest(x=observational_data_new_control[,3:12],
y=observational_data_new_control[,1])
treatment_effect_estimates[i] = mean(c((observational_data_new_treated[,1] -
predict(observational_randomForest_new_control,
observational_data_new_treated[,3:12])),
(predict(observational_randomForest_new_treated,
observational_data_new_control[,3:12]) -
observational_data_new_control[,1])))
setTxtProgressBar(progress_bar, i)
}
close(progress_bar)
hist(treatment_effect_estimates, main="Bootstrap Distribution of Estimates of Causal Estimand", xlab="Treatment Effect")
mean(treatment_effect_estimates)
sd(treatment_effect_estimates)
quantile(treatment_effect_estimates, prob=c(0.025,0.25,0.5,0.75,0.975))
[1] "The treatment effect estimate obtained from random forests is: -3761.77835337285" |======================================================================| 100%
Estimation of treatment effects through propensity score subclassification consists of the following broad steps.
Form subclasses of units with similar propensity scores.
Check that balance has been achieved on covariates.
Estimate treatment effects individually for each subclass.
Calculate the weighted average of treatment effect estimates across subclasses to estimate the treatment effect.
For example, consider subclasses defined by the quantiles (0,0.6,0.8,0.9,0.95,1) of the estimated propensity scores.
The number of treated units in each subclass are: 6, 10, 43, 62, 64.
The number of control units in each subclass are: 830, 268, 06, 8, 6.
quantiles = quantile(new_data$propensity_scores, probs=c(0, 0.6, 0.8, 0.9, 0.95, 1))
subclass_1 = new_data[new_data$propensity_scores<=quantiles[2],]
subclass_2 = new_data[new_data$propensity_scores<=quantiles[3] & new_data$propensity_scores>quantiles[2],]
subclass_3 = new_data[new_data$propensity_scores<=quantiles[4] & new_data$propensity_scores>quantiles[3],]
subclass_4 = new_data[new_data$propensity_scores<=quantiles[5] & new_data$propensity_scores>quantiles[4],]
subclass_5 = new_data[new_data$propensity_scores>quantiles[5],]
number_treated_units = c(sum(subclass_1$TREAT==1),
sum(subclass_2$TREAT==1),
sum(subclass_3$TREAT==1),
sum(subclass_4$TREAT==1),
sum(subclass_5$TREAT==1))
number_control_units = c(sum(subclass_1$TREAT==0),
sum(subclass_2$TREAT==0),
sum(subclass_3$TREAT==0),
sum(subclass_4$TREAT==0),
sum(subclass_5$TREAT==0))
number_treated_units
number_control_units
subclass = rep(NA, 1393)
for(i in 1:1393)
{
if(new_data$propensity_scores[i]<=quantiles[2]) subclass[i] = 1
if(new_data$propensity_scores[i]<=quantiles[3] & new_data$propensity_scores[i]>quantiles[2]) subclass[i] = 2
if(new_data$propensity_scores[i]<=quantiles[4] & new_data$propensity_scores[i]>quantiles[3]) subclass[i] = 3
if(new_data$propensity_scores[i]<=quantiles[5] & new_data$propensity_scores[i]>quantiles[4]) subclass[i] = 4
if(new_data$propensity_scores[i]>quantiles[5]) subclass[i] = 5
}
subclassified_data = cbind(new_data, subclass)
covariate_balance = function(subclass)
{
subclass_t_MARR = mean(subclass$MARR[subclass$TREAT==1])
subclass_c_MARR = mean(subclass$MARR[subclass$TREAT==0])
subclass_t_NODEGREE = mean(subclass$NODEGREE[subclass$TREAT==1])
subclass_c_NODEGREE = mean(subclass$NODEGREE[subclass$TREAT==0])
subclass_t_BLACK = mean(subclass$BLACK[subclass$TREAT==1])
subclass_c_BLACK = mean(subclass$BLACK[subclass$TREAT==0])
subclass_t_HISPANIC = mean(subclass$HISPANIC[subclass$TREAT==1])
subclass_c_HISPANIC = mean(subclass$HISPANIC[subclass$TREAT==0])
subclass_t_EDUC = mean(subclass$EDUC[subclass$TREAT==1])
subclass_c_EDUC = mean(subclass$EDUC[subclass$TREAT==0])
subclass_t_AGE = mean(subclass$AGE[subclass$TREAT==1])
subclass_c_AGE = mean(subclass$AGE[subclass$TREAT==0])
subclass_t_RE74 = mean(subclass$RE74[subclass$TREAT==1])
subclass_c_RE74 = mean(subclass$RE74[subclass$TREAT==0])
subclass_t_RE75 = mean(subclass$RE75[subclass$TREAT==1])
subclass_c_RE75 = mean(subclass$RE75[subclass$TREAT==0])
subclass_t_U74 = mean(subclass$U74[subclass$TREAT==1])
subclass_c_U74 = mean(subclass$U74[subclass$TREAT==0])
subclass_t_U75 = mean(subclass$U75[subclass$TREAT==1])
subclass_c_U75 = mean(subclass$U75[subclass$TREAT==0])
subclass_t = c(subclass_t_MARR, subclass_t_NODEGREE, subclass_t_BLACK,
subclass_t_HISPANIC, subclass_t_EDUC, subclass_t_AGE,
subclass_t_RE74, subclass_t_RE75, subclass_t_U74, subclass_t_U75)
subclass_c = c(subclass_c_MARR, subclass_c_NODEGREE, subclass_c_BLACK,
subclass_c_HISPANIC, subclass_c_EDUC, subclass_c_AGE,
subclass_c_RE74, subclass_c_RE75, subclass_c_U74, subclass_c_U75)
print(cbind(c("Covariate",
"Marriage",
"No High School Degree",
"African-American",
"Hispanic",
"Education",
"Age",
"1974 Income",
"1975 Income",
"Unemployed in 1974",
"Unemployed in 1975"),
c("Treated", round(subclass_t,5)),
c("Control", round(subclass_c,5))))
}
covariate_balance(subclass_1)
covariate_balance(subclass_2)
covariate_balance(subclass_3)
covariate_balance(subclass_4)
covariate_balance(subclass_5)
weights = number_treated_units/(sum(number_treated_units))
[,1] [,2] [,3] [1,] "Covariate" "Treated" "Control" [2,] "Marriage" "0.83333" "0.85783" [3,] "No High School Degree" "0.5" "0.35663" [4,] "African-American" "0.66667" "0.35783" [5,] "Hispanic" "0" "0.03735" [6,] "Education" "10.83333" "11.44217" [7,] "Age" "30.83333" "32.56265" [8,] "1974 Income" "13069.45" "13211.0693" [9,] "1975 Income" "12866.6" "11656.6609" [10,] "Unemployed in 1974" "0" "0.08554" [11,] "Unemployed in 1975" "0" "0.13373" [,1] [,2] [,3] [1,] "Covariate" "Treated" "Control" [2,] "Marriage" "0.6" "0.70522" [3,] "No High School Degree" "0.3" "0.48134" [4,] "African-American" "0.4" "0.51119" [5,] "Hispanic" "0.2" "0.06716" [6,] "Education" "11.7" "11.10075" [7,] "Age" "29.7" "30.64925" [8,] "1974 Income" "5823.02" "7364.91315" [9,] "1975 Income" "3277.96" "5045.04885" [10,] "Unemployed in 1974" "0.3" "0.29851" [11,] "Unemployed in 1975" "0.4" "0.33582" [,1] [,2] [,3] [1,] "Covariate" "Treated" "Control" [2,] "Marriage" "0.25581" "0.4375" [3,] "No High School Degree" "0.60465" "0.64583" [4,] "African-American" "0.76744" "0.80208" [5,] "Hispanic" "0.04651" "0.09375" [6,] "Education" "10.30233" "10.15625" [7,] "Age" "28.37209" "31.125" [8,] "1974 Income" "4429.32791" "3979.45456" [9,] "1975 Income" "2401.91628" "2690.48185" [10,] "Unemployed in 1974" "0.37209" "0.40625" [11,] "Unemployed in 1975" "0.32558" "0.41667" [,1] [,2] [,3] [1,] "Covariate" "Treated" "Control" [2,] "Marriage" "0.19355" "0" [3,] "No High School Degree" "0.67742" "0.5" [4,] "African-American" "0.85484" "0.875" [5,] "Hispanic" "0.08065" "0" [6,] "Education" "10.69355" "10.75" [7,] "Age" "26.8871" "21.25" [8,] "1974 Income" "976.9871" "3639.33808" [9,] "1975 Income" "806.92581" "2523.0121" [10,] "Unemployed in 1974" "0.77419" "0.5" [11,] "Unemployed in 1975" "0.66129" "0.25" [,1] [,2] [,3] [1,] "Covariate" "Treated" "Control" [2,] "Marriage" "0.01562" "0" [3,] "No High School Degree" "0.89062" "0.83333" [4,] "African-American" "0.96875" "0.83333" [5,] "Hispanic" "0.03125" "0.16667" [6,] "Education" "9.78125" "10.66667" [7,] "Age" "21.98438" "22.66667" [8,] "1974 Income" "0" "0" [9,] "1975 Income" "314.67969" "161.12903" [10,] "Unemployed in 1974" "1" "1" [11,] "Unemployed in 1975" "0.8125" "0.66667"
options(warn=-1)
number_of_bootstrap_draws = 10^3
observational_data = subclass_1
treatment_effect_estimates_1 = rep(NA, number_of_bootstrap_draws)
observational_data_treated = observational_data[observational_data$TREAT==1,]
observational_data_control = observational_data[observational_data$TREAT==0,]
progress_bar = txtProgressBar(min=1, max=number_of_bootstrap_draws, style = 3)
for(i in 1:number_of_bootstrap_draws)
{
observational_data_new_treated = observational_data_treated[sample((1:nrow(observational_data_treated)), replace=TRUE),]
observational_randomForest_new_treated = randomForest(x=observational_data_new_treated[,3:12],
y=observational_data_new_treated[,1])
observational_data_new_control = observational_data_control[sample((1:nrow(observational_data_control)), replace=TRUE),]
observational_randomForest_new_control = randomForest(x=observational_data_new_control[,3:12],
y=observational_data_new_control[,1])
treatment_effect_estimates_1[i] = mean(c((observational_data_new_treated[,1] -
predict(observational_randomForest_new_control,
observational_data_new_treated[,3:12])),
(predict(observational_randomForest_new_treated,
observational_data_new_control[,3:12]) -
observational_data_new_control[,1])))
setTxtProgressBar(progress_bar, i)
}
close(progress_bar)
observational_data = subclass_2
treatment_effect_estimates_2 = rep(NA, number_of_bootstrap_draws)
observational_data_treated = observational_data[observational_data$TREAT==1,]
observational_data_control = observational_data[observational_data$TREAT==0,]
progress_bar = txtProgressBar(min=1, max=number_of_bootstrap_draws, style = 3)
for(i in 1:number_of_bootstrap_draws)
{
observational_data_new_treated = observational_data_treated[sample((1:nrow(observational_data_treated)), replace=TRUE),]
observational_randomForest_new_treated = randomForest(x=observational_data_new_treated[,3:12],
y=observational_data_new_treated[,1])
observational_data_new_control = observational_data_control[sample((1:nrow(observational_data_control)), replace=TRUE),]
observational_randomForest_new_control = randomForest(x=observational_data_new_control[,3:12],
y=observational_data_new_control[,1])
treatment_effect_estimates_2[i] = mean(c((observational_data_new_treated[,1] -
predict(observational_randomForest_new_control,
observational_data_new_treated[,3:12])),
(predict(observational_randomForest_new_treated,
observational_data_new_control[,3:12]) -
observational_data_new_control[,1])))
setTxtProgressBar(progress_bar, i)
}
close(progress_bar)
observational_data = subclass_3
treatment_effect_estimates_3 = rep(NA, number_of_bootstrap_draws)
observational_data_treated = observational_data[observational_data$TREAT==1,]
observational_data_control = observational_data[observational_data$TREAT==0,]
progress_bar = txtProgressBar(min=1, max=number_of_bootstrap_draws, style = 3)
for(i in 1:number_of_bootstrap_draws)
{
observational_data_new_treated = observational_data_treated[sample((1:nrow(observational_data_treated)), replace=TRUE),]
observational_randomForest_new_treated = randomForest(x=observational_data_new_treated[,3:12],
y=observational_data_new_treated[,1])
observational_data_new_control = observational_data_control[sample((1:nrow(observational_data_control)), replace=TRUE),]
observational_randomForest_new_control = randomForest(x=observational_data_new_control[,3:12],
y=observational_data_new_control[,1])
treatment_effect_estimates_3[i] = mean(c((observational_data_new_treated[,1] -
predict(observational_randomForest_new_control,
observational_data_new_treated[,3:12])),
(predict(observational_randomForest_new_treated,
observational_data_new_control[,3:12]) -
observational_data_new_control[,1])))
setTxtProgressBar(progress_bar, i)
}
close(progress_bar)
observational_data = subclass_4
treatment_effect_estimates_4 = rep(NA, number_of_bootstrap_draws)
observational_data_treated = observational_data[observational_data$TREAT==1,]
observational_data_control = observational_data[observational_data$TREAT==0,]
progress_bar = txtProgressBar(min=1, max=number_of_bootstrap_draws, style = 3)
for(i in 1:number_of_bootstrap_draws)
{
observational_data_new_treated = observational_data_treated[sample((1:nrow(observational_data_treated)), replace=TRUE),]
observational_randomForest_new_treated = randomForest(x=observational_data_new_treated[,3:12],
y=observational_data_new_treated[,1])
observational_data_new_control = observational_data_control[sample((1:nrow(observational_data_control)), replace=TRUE),]
observational_randomForest_new_control = randomForest(x=observational_data_new_control[,3:12],
y=observational_data_new_control[,1])
treatment_effect_estimates_4[i] = mean(c((observational_data_new_treated[,1] -
predict(observational_randomForest_new_control,
observational_data_new_treated[,3:12])),
(predict(observational_randomForest_new_treated,
observational_data_new_control[,3:12]) -
observational_data_new_control[,1])))
setTxtProgressBar(progress_bar, i)
}
close(progress_bar)
observational_data = subclass_5
treatment_effect_estimates_5 = rep(NA, number_of_bootstrap_draws)
observational_data_treated = observational_data[observational_data$TREAT==1,]
observational_data_control = observational_data[observational_data$TREAT==0,]
progress_bar = txtProgressBar(min=1, max=number_of_bootstrap_draws, style = 3)
for(i in 1:number_of_bootstrap_draws)
{
observational_data_new_treated = observational_data_treated[sample((1:nrow(observational_data_treated)), replace=TRUE),]
observational_randomForest_new_treated = randomForest(x=observational_data_new_treated[,3:12],
y=observational_data_new_treated[,1])
observational_data_new_control = observational_data_control[sample((1:nrow(observational_data_control)), replace=TRUE),]
observational_randomForest_new_control = randomForest(x=observational_data_new_control[,3:12],
y=observational_data_new_control[,1])
treatment_effect_estimates_5[i] = mean(c((observational_data_new_treated[,1] -
predict(observational_randomForest_new_control,
observational_data_new_treated[,3:12])),
(predict(observational_randomForest_new_treated,
observational_data_new_control[,3:12]) -
observational_data_new_control[,1])))
setTxtProgressBar(progress_bar, i)
}
close(progress_bar)
overall_treatment_effect_estimates = weights%*%rbind(t(treatment_effect_estimates_1),
t(treatment_effect_estimates_2),
t(treatment_effect_estimates_3),
t(treatment_effect_estimates_4),
t(treatment_effect_estimates_5))
hist(overall_treatment_effect_estimates, main="Bootstrap Distribution of Estimates of Causal Estimand", xlab="Treatment Effect")
quantile(overall_treatment_effect_estimates, prob=c(0.025,0.25,0.5,0.75,0.975))
|======================================================================| 100% |======================================================================| 100% |======================================================================| 100% |======================================================================| 100% |======================================================================| 100%
There are four broad classes of strategies for causal inferences from designed observational studies (Imbens & Rubin, 2015: p. 270 - 276).
Combinations of these strategies can also be implemented in practice.
Subclassification with covariate adjustment within subclasses, and matching with covariate adjustment, are particularly attractive methods.
"Experiments should be analyzed as experiments, not observational studies" (Freedman, 2006: p. 691).
Implication: Analyze experiments as you designed them, in particular, by means of randomization-based inferences. Don't analyze experiments by regression models (and, by extension, machine learning algorithms) typically used in the analyses of observational studies.
Counterargument:
Experiments have unconfounded assignment mechanisms.
$\Rightarrow$ Potential outcomes in experiments are missing at random.
$\Rightarrow$ Imputation methods based on models can be used to impute missing potential outcomes and perform inferences on finite-population causal estimands (assuming the models are appropriate).
"... randomization does not justify the assumptions behind the ols [ordinary least squares] model" (Freedman, 2008: p. 181).
Counterargument:
Randomization corresponds to an unconfounded assignment mechanism, which justifies standard types of imputation methods.
If a model does not capture the relationships between the potential outcomes, treatments, and covariates, specify a better model.
Recall the three assumptions for a regular assignment mechanism.
There exists a function $q: \mathbb{R}^{K+2} \rightarrow (0,1)$ such that for all subjects $i$, $$ p_i(\mathbf{X}, \mathbf{Y}(0), \mathbf{Y}(1)) = q(X_i, Y_i(0), Y_i(1)) $$ and $$\mathrm{Pr} \{ \mathbf{W} \mid \mathbf{X}, \mathbf{Y}(0), \mathbf{Y}(1) \} = c \displaystyle \prod_{i=1}^N q(X_i, Y_i(0), Y_i(1))^{W_i}\{1-q(X_i,Y_i(0),Y_i(1))\}^{1-W_i} $$ for some set of $(\mathbf{W}, \mathbf{X}, \mathbf{Y}(0), \mathbf{Y}(1))$, where $c$ is the normalization constant for the probability mass function of the treatment assignment mechanism.
For all subjects $i$, $0 < p_i(\mathbf{X}, \mathbf{Y}(0), \mathbf{Y}(1)) < 1$.
For any $\mathbf{w} \in \{0, 1\}^N$ and potential outcome vectors $\mathbf{Y}(0), \mathbf{Y}'(0), \mathbf{Y}(1), \mathbf{Y}'(1) \in \mathbb{R}^N$, $$ \mathrm{Pr} \{ \mathbf{W} = \mathbf{w} \mid \mathbf{X}, \mathbf{Y}(0), \mathbf{Y}(1) \} = \mathrm{Pr} \{ \mathbf{W} = \mathbf{w} \mid \mathbf{X}, \mathbf{Y}'(0), \mathbf{Y}'(1) \}. $$
☝ Besides sequential experiments (such as multi-armed bandits), the individualistic assignment mechanism is not particularly controversial.
☝ If an experimental unit receives treatment with either probability zero or probability one, then estimates of the treatment effect for similar such experimental units will necessarily involve extrapolation, and so such units should not be considered for causal inferences. The probabilistic assignment mechanism is thus intuitive and justifiable.
☝ The unconfoundedness assumption is perhaps the most controversial assumption for causal inference on observational studies under the Rubin Causal Model. Having said that, it is commonly invoked across a wide range of domains (Imbens & Rubin, 2015: p. 262 - 263).
These regularity assumptions do not merely enable causal inferences for randomized experiments. They can also enable valid causal inferences for observational studies.
💡 If an (unknown) assignment mechanism is regular, then for that assignment mechanism we have that
a completely randomized design was effectively conducted for subpopulations of experimental units with the same covariates, and that
a causal interpretation can be attached to the comparison of observed outcomes for treated and control units within the subpopulations.
The second implication holds because the observed outcomes within a particular subpopulation constitute a random sample of the potential outcomes for that subpopulation, so that the difference in observed averages is unbiased for the subpopulation average treatment effect.
💡 The fact that the assignment mechanism is unknown does not matter for this result.
These desirable implications of a regular assignment mechanism can be operationalized for deriving valid, unbiased causal inferences from observational studies that have regular assignment mechanisms by means of the propensity scores $$ e(X_i) = \mathrm{Pr}(W_i = 1 \mid X_i). $$
Claim : For a regular assignment mechanism, $\mathrm{Pr} \left \{ W_i = 1 \mid X_i, e(X_i) \right \} = \mathrm{Pr} \left \{ W_i = 1 \mid e(X_i) \right \}$.
Proof : As $e(X_i)$ is a function of $X_i$, $\mathrm{Pr} \left \{ W_i = 1 \mid X_i, e(X_i) \right \} = \mathrm{Pr} \left \{ W_i = 1 \mid X_i \right \} = e(X_i)$. Also, by ADAM's Law,
\begin{align} \mathrm{Pr} \left \{ W_i = 1 \mid e(X_i) \right \} &= \mathbb{E} \left \{ W_i \mid e(X_i) \right \} \\ &= \mathbb{E} \left \{ \mathbb{E} \left \{ W_i \mid X_i, e(X_i) \right \} \mid e(X_i) \right \} \\ &= \mathbb{E} \left \{ e(X_i) \mid e(X_i) \right \} \\ &= e(X_i). \end{align}We thus immediately have the result. ∎
💡 An estimated propensity score serves as a one-dimensional summary of all covariates, such that experimental units with the same propensity score values have similar covariates.
💡 If many covariates are considered for the design and analysis of an observational study (e.g., so as to ensure unconfoundedness), it can be sufficient to design the study based on the propensity score so as to yield balance on the covariates involved in the propensity score.
Claim : For an unconfounded treatment assignment mechanism, the treatment assignment is unconfounded given the propensity scores, i.e., $$ \mathrm{Pr} \left \{ W_i \mid Y_i(0), Y_i(1), e(X_i) \right \} = \mathrm{Pr} \left \{ W_i \mid e(X_i) \right \} = e(X_i). $$
Proof :
\begin{align} \mathrm{Pr} \left \{ W_i = 1 \mid Y_i(0), Y_i(1), e(X_i) \right \} &= \mathbb{E} \left \{ W_i \mid Y_i(0), Y_i(1), e(X_i) \right \} \\ &= \mathbb{E} \left \{ \mathbb{E} \left \{ W_i \mid Y_i(0), Y_i(1), X_i, e(X_i) \right \} \mid Y_i(0), Y_i(1), e(X_i) \right \} \\ &= \mathbb{E} \left \{ \mathbb{E} \left \{ W_i \mid X_i, e(X_i) \right \} \mid Y_i(0), Y_i(1), e(X_i) \right \} \\ &= e(X_i) \end{align}💡 Given a vector of covariates that ensure unconfoundedness, adjustment for treatment-control differences in propensity scores suffices for removing biases associated with the differences in covariates.
☝ This situation is analogous to that of a completely randomized design.
Claim : If $b(X_i)$ is a function of the covariates such that $\mathrm{Pr} \left \{ W_i = 1 \mid X_i, b(X_i) \right \} = \mathrm{Pr} \left \{ W_i = 1 \mid b(X_i) \right \}$ (i.e., $b(X_i)$ is a balancing score), then $e(X_i) = g(b(X_i))$ for some function $g$.
💡 The propensity score provides the biggest benefit in terms of reducing the number of variables we need to adjust for.
The unconfoundeness assumption is critical to the utility of regular assignment mechanisms.
For an assignment mechanism to be unconfounded, we must have a sufficiently rich set of covariates such that adjusting for differences in their observed values between the treatment and control groups will remove systematic biases in the causal inferences.
The unconfoundness assumption is typically the most controversial assumption for causal inferences on observational studies under the Rubin Causal Model.
Furthermore, this assumption is not testable: the data are not directly informative about the distribution of $Y_i(0)$ for those units $i$ given treatment, nor are they directly informative about the distribution of $Y_j(1)$ for those units $j$ given control.
Our inability to test unconfoundedness in general is similar to our inability to test whether a missing data mechanism is missing not at random (MNAR) or missing at random (MAR).
As unconfoundedness is such an important and controversial assumption, supplementary analyses that can assess its plausibility can be important for drawing causal conclusions.
The superpopulation perspective can be helpful for evaluating the frequentist properties of causal inference methods for observational studies with regular assignment mechanisms.
☝ In the previous derivations, covariates $X_i$ were considered to have been randomly drawn from a distribution.
The superpopulation perspective on unconfoundedness is also helpful for clarifying how the distributions of missing and observed potential outcomes are related.
💡 We have under unconfoundedness that for all $i = 1, \ldots, N$, $$ \left ( Y_i^{\mathrm{mis}} \mid W_i = w, X_i \right ) \sim \left ( Y_i^{\mathrm{obs}} \mid W_i = 1 - w, X_i \right), $$ so that as $Y_i^{\mathrm{mis}} = (1-W_i)Y_i(1) + W_iY_i(0)$, $Y_i^{\mathrm{obs}} = W_iY_i(1) + (1-W_i)Y_i(0)$, the distribution of a missing potential outcome equals the distribution of an observable outcome.
This is important because the data are not directly informative about the distribution of $Y_i^{\mathrm{mis}}$ unless we have an assumption such as unconfoundedness that connects this distribution to distributions that can be inferred from the data.
☝ Unconfoundedness is not a testable assumption. However, it does imply that one should compare "like with like", and thus it is well-motivated.
Although unconfoundedness is not testable, there do exist analyses that can be done to assess its validity.
If unconfoundedness is valid for an observational study, then biases in the causal inferences can be removed by adjusting for differences in covariates.
Such adjustments can be difficult to perform if there are a large number of covariates.
The propensity score is effectively a low-dimensional summary of covariates that is sufficient for removing biases in causal inferences from observational studies in which treatment and control groups differ in terms of covariates.
☝ In practice, it may not be necessary to get precise inferences on propensity scores. Instead, we use propensity score models as devices for designing observational studies with good covariate balance.
Covariate imbalances in observational studies can make causal inferences sensitive to changes in analysis methods (e.g., standard Neymanian inferences versus model-based inferences) and their specifications (e.g., covariates included in potential outcomes models).
In addition, covariate imbalances can make causal inferences imprecise.
Performing a design phase for an observational study can lead to robust and valid causal inferences for the sample of subjects included in the study.
For an observational study with a regular assignment mechanism, the estimated propensity scores can be useful for designing the observational study and obtaining valid, unbiased causal inferences.
The goal in estimating the propensity score is to design the observational study based on the estimates so as to obtain balanced covariate distributions between the treatment and control groups.
The goal in estimating the propensity score is *not* to obtain the best estimate of the true propensity scores.
Sometimes estimated propensity scores are better than true propensity scores in terms of yielding designed observational studies with good covariate balance.
Typically, we are not interested in interpreting the propensity scores, although assessing the strength and nature of the dependence of treatment assignment on the covariates could be helpful for assessing the unconfoundedness assumption.
The adequacy of a propensity score model specification can be assessed by means of the property that, for a regular assignment mechanism, $\mathrm{Pr} \left \{ W_i = 1 \mid X_i, e(X_i) \right \} = \mathrm{Pr} \left \{ W_i = 1 \mid e(X_i) \right \}$.
Specifically, using the estimates $\hat{e}(X_i)$ of the propensity scores, check whether the covariates are independent of the treatment group.
This assessment is done after constructing subclasses/strata/blocks of experimental units such that the units within a specific subclass have similar propensity scores (i.e., the propensity score estimates within a subclass have small variability).
If subclasses have small variability in the propensity score estimates, but poor covariate balance, then the propensity score model is most likely poorly specified.
Trimming the experimental units based on the propensity score estimates can be helpful, but be aware of how trimming could be connected to the choice of estimand (e.g., trimming procedures can differ depending on whether the ATE or ATT causal estimand is of interest (Imbens & Rubin, 2015: p. 313)).
Imbens & Rubin (2015, p. 293 - 294) provide an automated procedure for propensity score subclassification based on propensity score estimates.
We are interested in comparing two multivariate covariate distributions: one for the treatment group, and the other for the control group.
In practice, we compare the centers and spreads/dispersions of the two distributions.
This can be done via a $t$-test, where the numerator consists of a weighted average of covariate averages across the subclasses, and the denominator consists of a weighted average of sample variances across the subclasses (Imbens & Rubin, 2015: p. 298).
The $t$-test in this context is actually meant to assess whether the differences between the two distributions are so great that biases will remain in causal inferences.
The $t$-test will inevitably reject the null hypothesis in large samples, even if the normalized difference stays the same. As such, it can be less relevant than the normalized difference (Imbens & Rubin, 2015: p. 311).
Similar tests can be performed for comparing the dispersions in the two covariate distributions (Imbens & Rubin, 2015: p. 312)
We can consider transformations of the covariates as well. The propensity score can be seen as one such transformation.
Visualizations can be extremely useful for assessing covariate balance.
Love plot.
Overlapping histograms for a single covariate across the treatment and control groups.
Overlapping empirical CDFs for a single covariate across the treatment and control groups.
QQ plots of t-test statistics. If the covariate are well-balanced, then the QQ plots should be flatter than a 45-degree line.
Histograms of t-test statistics. If the covariates are well-balanced, then the histogram should be approximately t-distributed.
If the covariate distributions for the treatment and control groups are similar, then one should be concerned less about the sensitivity of estimates, and one can have some assurance that they removed bias from the analyses of the observational study.
Imbens and Rubin (2015, p. 318 - 332) provide four case studies that demonstrate how covariate balance can be assessed in practice.
The manual identification of subclasses and matches can take some time and effort, but is ultimately useful for better understanding the experimental units in the observational study, and fruitful for understanding the mechanics of subclassification and matching.
Consider the setting in which
there is a large pool of possible controls for the design of an observational study,
the set of treated units is fixed a priori, and
the estimand of interest is the average treatment effect for the treated (ATT).
In such a setting, matching methods can be implemented to effectively select a control sample that is well-balanced with respect to the treated experimental units.
In a matching procedure, one or more distinct controls are matched to each treated unit so that their covariates are all very similar to one another.
Matching experimental units can make the subsequent analyses more robust and credible.
#install.packages("Matching")
library(Matching)
observational_data = read.table("Lalonde_observational_data.txt", header=T, sep="")
propensity_score_model = glm(TREAT~EDUC + AGE + RE74 + RE75 + U74 + U75 + BLACK + MARR + HISPANIC +
EDUC:AGE + RE75:MARR + AGE:BLACK + EDUC:U74 + U74:U75 + AGE:MARR +
AGE:RE75 + EDUC:RE75 + BLACK:MARR + AGE:U74 + AGE:U75,
data=observational_data,
family = "binomial")
propensity_score_estimates = propensity_score_model$fitted
observed_outcomes = observational_data$RE78
treatment = observational_data$TREAT
matched_dataset = Match(Y=observed_outcomes, X=propensity_score_estimates, Tr=treatment, M=1)
summary(matched_dataset)
matched_dataset_matrix = rbind(observational_data[matched_dataset$index.treated,],
observational_data[matched_dataset$index.control,])
head(matched_dataset_matrix)
tail(matched_dataset_matrix)
Loading required package: MASS Attaching package: 'MASS' The following object is masked from 'package:sm': muscle ## ## Matching (Version 4.9-7, Build Date: 2020-02-05) ## See http://sekhon.berkeley.edu/matching for additional documentation. ## Please cite software as: ## Jasjeet S. Sekhon. 2011. ``Multivariate and Propensity Score Matching ## Software with Automated Balance Optimization: The Matching package for R.'' ## Journal of Statistical Software, 42(7): 1-52. ##
Estimate... 1558.2 AI SE...... 1751 T-stat..... 0.88987 p.val...... 0.37354 Original number of observations.............. 2675 Original number of treated obs............... 185 Matched number of observations............... 185 Matched number of observations (unweighted). 548
RE78 | TREAT | MARR | NODEGREE | BLACK | HISPANIC | EDUC | AGE | RE74 | RE75 | U74 | U75 | TYPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <dbl> | <dbl> | <int> | <int> | <fct> | |
1 | 9930.0 | 1 | 1 | 1 | 1 | 0 | 11 | 37 | 0 | 0 | 1 | 1 | EXP |
2 | 3595.9 | 1 | 0 | 1 | 0 | 1 | 9 | 22 | 0 | 0 | 1 | 1 | EXP |
3 | 24909.5 | 1 | 0 | 0 | 1 | 0 | 12 | 30 | 0 | 0 | 1 | 1 | EXP |
3.1 | 24909.5 | 1 | 0 | 0 | 1 | 0 | 12 | 30 | 0 | 0 | 1 | 1 | EXP |
4 | 7506.1 | 1 | 0 | 1 | 1 | 0 | 11 | 27 | 0 | 0 | 1 | 1 | EXP |
5 | 289.8 | 1 | 0 | 1 | 1 | 0 | 8 | 33 | 0 | 0 | 1 | 1 | EXP |
RE78 | TREAT | MARR | NODEGREE | BLACK | HISPANIC | EDUC | AGE | RE74 | RE75 | U74 | U75 | TYPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <dbl> | <dbl> | <int> | <int> | <fct> | |
2473 | 17732.72 | 0 | 1 | 1 | 1 | 0 | 7 | 50 | 20384.21 | 17008.06 | 0 | 0 | OBS |
2489 | 17732.72 | 0 | 1 | 1 | 1 | 0 | 6 | 48 | 21551.94 | 16112.90 | 0 | 0 | OBS |
2495 | 31032.26 | 0 | 1 | 1 | 1 | 0 | 10 | 22 | 21551.94 | 25064.52 | 0 | 0 | OBS |
2547 | 29554.53 | 0 | 1 | 1 | 1 | 0 | 11 | 44 | 24294.91 | 35806.45 | 0 | 0 | OBS |
2548 | 29554.53 | 0 | 1 | 1 | 1 | 0 | 11 | 44 | 24294.91 | 35806.45 | 0 | 0 | OBS |
2607 | 45809.52 | 0 | 0 | 1 | 1 | 0 | 11 | 36 | 29389.00 | 25982.95 | 0 | 0 | OBS |
matched_dataset_balance = MatchBalance(TREAT~EDUC + AGE + RE74 + RE75 + U74 + U75 + BLACK + MARR + HISPANIC +
EDUC:AGE + RE75:MARR + AGE:BLACK + EDUC:U74 + U74:U75 + AGE:MARR +
AGE:RE75 + EDUC:RE75 + BLACK:MARR + AGE:U74 + AGE:U75,
data=observational_data,
match.out=matched_dataset,
nboots=10)
***** (V1) EDUC ***** Before Matching After Matching mean treatment........ 10.346 10.346 mean control.......... 12.117 10.474 std mean diff......... -88.077 -6.3509 mean raw eQQ diff..... 1.8595 0.91058 med raw eQQ diff..... 2 1 max raw eQQ diff..... 5 3 mean eCDF diff........ 0.1091 0.056911 med eCDF diff........ 0.01944 0.037409 max eCDF diff........ 0.40289 0.32117 var ratio (Tr/Co)..... 0.42549 0.77938 T-test p-value........ < 2.22e-16 0.50116 KS Bootstrap p-value.. < 2.22e-16 < 2.22e-16 KS Naive p-value...... < 2.22e-16 < 2.22e-16 KS Statistic.......... 0.40289 0.32117 ***** (V2) AGE ***** Before Matching After Matching mean treatment........ 25.816 25.816 mean control.......... 34.851 24.934 std mean diff......... -126.27 12.324 mean raw eQQ diff..... 9.0432 4.7464 med raw eQQ diff..... 8 4 max raw eQQ diff..... 17 15 mean eCDF diff........ 0.23165 0.1217 med eCDF diff........ 0.25299 0.11679 max eCDF diff........ 0.37714 0.26825 var ratio (Tr/Co)..... 0.46963 1.0232 T-test p-value........ < 2.22e-16 0.20959 KS Bootstrap p-value.. < 2.22e-16 < 2.22e-16 KS Naive p-value...... < 2.22e-16 < 2.22e-16 KS Statistic.......... 0.37714 0.26825 ***** (V3) RE74 ***** Before Matching After Matching mean treatment........ 2095.6 2095.6 mean control.......... 19429 2400.2 std mean diff......... -354.71 -6.2332 mean raw eQQ diff..... 17663 2828.8 med raw eQQ diff..... 18417 2028 max raw eQQ diff..... 102109 10745 mean eCDF diff........ 0.46806 0.10757 med eCDF diff........ 0.54766 0.10036 max eCDF diff........ 0.72924 0.24818 var ratio (Tr/Co)..... 0.13285 0.86071 T-test p-value........ < 2.22e-16 0.49457 KS Bootstrap p-value.. < 2.22e-16 < 2.22e-16 KS Naive p-value...... < 2.22e-16 4.4409e-15 KS Statistic.......... 0.72924 0.24818 ***** (V4) RE75 ***** Before Matching After Matching mean treatment........ 1532.1 1532.1 mean control.......... 19063 1295 std mean diff......... -544.58 7.3634 mean raw eQQ diff..... 17978 5073.1 med raw eQQ diff..... 17903 4876.4 max raw eQQ diff..... 131511 14400 mean eCDF diff........ 0.46947 0.14409 med eCDF diff........ 0.53317 0.10401 max eCDF diff........ 0.77362 0.33942 var ratio (Tr/Co)..... 0.056057 1.3076 T-test p-value........ < 2.22e-16 0.26274 KS Bootstrap p-value.. < 2.22e-16 < 2.22e-16 KS Naive p-value...... < 2.22e-16 < 2.22e-16 KS Statistic.......... 0.77362 0.33942 ***** (V5) U74 ***** Before Matching After Matching mean treatment........ 0.70811 0.70811 mean control.......... 0.086345 0.72625 std mean diff......... 136.39 -3.9796 mean raw eQQ diff..... 0.62162 0.10219 med raw eQQ diff..... 1 0 max raw eQQ diff..... 1 1 mean eCDF diff........ 0.31088 0.051095 med eCDF diff........ 0.31088 0.051095 max eCDF diff........ 0.62176 0.10219 var ratio (Tr/Co)..... 2.6332 1.0396 T-test p-value........ < 2.22e-16 0.5758 ***** (V6) U75 ***** Before Matching After Matching mean treatment........ 0.6 0.6 mean control.......... 0.1 0.56191 std mean diff......... 101.79 7.7543 mean raw eQQ diff..... 0.4973 0.074818 med raw eQQ diff..... 0 0 max raw eQQ diff..... 1 1 mean eCDF diff........ 0.25 0.037409 med eCDF diff........ 0.25 0.037409 max eCDF diff........ 0.5 0.074818 var ratio (Tr/Co)..... 2.6801 0.97495 T-test p-value........ < 2.22e-16 0.40439 ***** (V7) BLACK ***** Before Matching After Matching mean treatment........ 0.84324 0.84324 mean control.......... 0.2506 0.84356 std mean diff......... 162.56 -0.08805 mean raw eQQ diff..... 0.58919 0.26095 med raw eQQ diff..... 1 0 max raw eQQ diff..... 1 1 mean eCDF diff........ 0.29632 0.13047 med eCDF diff........ 0.29632 0.13047 max eCDF diff........ 0.59264 0.26095 var ratio (Tr/Co)..... 0.70739 1.0017 T-test p-value........ < 2.22e-16 0.99313 ***** (V8) MARR ***** Before Matching After Matching mean treatment........ 0.18919 0.18919 mean control.......... 0.86627 0.13094 std mean diff......... -172.41 14.833 mean raw eQQ diff..... 0.67568 0.083942 med raw eQQ diff..... 1 0 max raw eQQ diff..... 1 1 mean eCDF diff........ 0.33854 0.041971 med eCDF diff........ 0.33854 0.041971 max eCDF diff........ 0.67708 0.083942 var ratio (Tr/Co)..... 1.3308 1.348 T-test p-value........ < 2.22e-16 0.053749 ***** (V9) HISPANIC ***** Before Matching After Matching mean treatment........ 0.059459 0.059459 mean control.......... 0.03253 0.06665 std mean diff......... 11.357 -3.0324 mean raw eQQ diff..... 0.027027 0.0091241 med raw eQQ diff..... 0 0 max raw eQQ diff..... 1 1 mean eCDF diff........ 0.013465 0.004562 med eCDF diff........ 0.013465 0.004562 max eCDF diff........ 0.026929 0.0091241 var ratio (Tr/Co)..... 1.7859 0.89899 T-test p-value........ 0.13173 0.78327 ***** (V10) EDUC:AGE ***** Before Matching After Matching mean treatment........ 266.98 266.98 mean control.......... 414.7 254.95 std mean diff......... -159.71 13.009 mean raw eQQ diff..... 148.93 45.797 med raw eQQ diff..... 127 27 max raw eQQ diff..... 337 281 mean eCDF diff........ 0.23659 0.087591 med eCDF diff........ 0.24338 0.071168 max eCDF diff........ 0.47566 0.2208 var ratio (Tr/Co)..... 0.35759 1.8184 T-test p-value........ < 2.22e-16 0.080997 KS Bootstrap p-value.. < 2.22e-16 < 2.22e-16 KS Naive p-value...... < 2.22e-16 4.9879e-12 KS Statistic.......... 0.47566 0.2208 ***** (V11) RE75:MARR ***** Before Matching After Matching mean treatment........ 654.34 654.34 mean control.......... 17200 555.82 std mean diff......... -577.88 3.4407 mean raw eQQ diff..... 17005 5981.4 med raw eQQ diff..... 16471 1868 max raw eQQ diff..... 131511 18876 mean eCDF diff........ 0.41271 0.20712 med eCDF diff........ 0.44038 0.21168 max eCDF diff........ 0.7094 0.34124 var ratio (Tr/Co)..... 0.038805 1.4752 T-test p-value........ < 2.22e-16 0.6237 KS Bootstrap p-value.. < 2.22e-16 < 2.22e-16 KS Naive p-value...... < 2.22e-16 < 2.22e-16 KS Statistic.......... 0.7094 0.34124 ***** (V12) AGE:BLACK ***** Before Matching After Matching mean treatment........ 21.908 21.908 mean control.......... 8.561 21.284 std mean diff......... 115.05 5.3753 mean raw eQQ diff..... 14.108 10.363 med raw eQQ diff..... 18 9 max raw eQQ diff..... 27 26 mean eCDF diff........ 0.11639 0.1547 med eCDF diff........ 0.028319 0.062956 max eCDF diff........ 0.59264 0.39781 var ratio (Tr/Co)..... 0.54503 1.0575 T-test p-value........ < 2.22e-16 0.60195 KS Bootstrap p-value.. < 2.22e-16 < 2.22e-16 KS Naive p-value...... < 2.22e-16 < 2.22e-16 KS Statistic.......... 0.59264 0.39781 ***** (V13) EDUC:U74 ***** Before Matching After Matching mean treatment........ 7.2595 7.2595 mean control.......... 1.0149 7.6122 std mean diff......... 124.84 -7.0528 mean raw eQQ diff..... 6.2757 1.3193 med raw eQQ diff..... 8 0 max raw eQQ diff..... 12 9 mean eCDF diff........ 0.31576 0.069685 med eCDF diff........ 0.36187 0.091241 max eCDF diff........ 0.62337 0.11314 var ratio (Tr/Co)..... 2.1073 0.96825 T-test p-value........ < 2.22e-16 0.33387 KS Bootstrap p-value.. < 2.22e-16 < 2.22e-16 KS Naive p-value...... < 2.22e-16 0.0017973 KS Statistic.......... 0.62337 0.11314 ***** (V14) U74:U75 ***** Before Matching After Matching mean treatment........ 0.58919 0.58919 mean control.......... 0.06506 0.54711 std mean diff......... 106.25 8.5292 mean raw eQQ diff..... 0.52432 0.065693 med raw eQQ diff..... 1 0 max raw eQQ diff..... 1 1 mean eCDF diff........ 0.26206 0.032847 med eCDF diff........ 0.26206 0.032847 max eCDF diff........ 0.52413 0.065693 var ratio (Tr/Co)..... 3.9992 0.97685 T-test p-value........ < 2.22e-16 0.348 ***** (V15) AGE:MARR ***** Before Matching After Matching mean treatment........ 5.5568 5.5568 mean control.......... 30.895 3.9517 std mean diff......... -212.05 13.432 mean raw eQQ diff..... 25.324 6.1679 med raw eQQ diff..... 26 5 max raw eQQ diff..... 46 26 mean eCDF diff........ 0.35457 0.12156 med eCDF diff........ 0.35271 0.10584 max eCDF diff........ 0.68007 0.27007 var ratio (Tr/Co)..... 0.5961 1.2805 T-test p-value........ < 2.22e-16 0.086428 KS Bootstrap p-value.. < 2.22e-16 < 2.22e-16 KS Naive p-value...... < 2.22e-16 < 2.22e-16 KS Statistic.......... 0.68007 0.27007 ***** (V16) AGE:RE75 ***** Before Matching After Matching mean treatment........ 41167 41167 mean control.......... 688283 33602 std mean diff......... -650.13 7.601 mean raw eQQ diff..... 667667 189126 med raw eQQ diff..... 555000 120424 max raw eQQ diff..... 6219703 745791 mean eCDF diff........ 0.44102 0.16092 med eCDF diff........ 0.47011 0.12044 max eCDF diff........ 0.77742 0.32847 var ratio (Tr/Co)..... 0.027142 1.3168 T-test p-value........ < 2.22e-16 0.2722 KS Bootstrap p-value.. < 2.22e-16 < 2.22e-16 KS Naive p-value...... < 2.22e-16 < 2.22e-16 KS Statistic.......... 0.77742 0.32847 ***** (V17) EDUC:RE75 ***** Before Matching After Matching mean treatment........ 15881 15881 mean control.......... 244584 13519 std mean diff......... -673.13 6.9515 mean raw eQQ diff..... 232740 52650 med raw eQQ diff..... 208394 41370 max raw eQQ diff..... 1603275 161984 mean eCDF diff........ 0.46022 0.13929 med eCDF diff........ 0.51628 0.12682 max eCDF diff........ 0.75481 0.31387 var ratio (Tr/Co)..... 0.026613 1.23 T-test p-value........ < 2.22e-16 0.27885 KS Bootstrap p-value.. < 2.22e-16 < 2.22e-16 KS Naive p-value...... < 2.22e-16 < 2.22e-16 KS Statistic.......... 0.75481 0.31387 ***** (V18) BLACK:MARR ***** Before Matching After Matching mean treatment........ 0.15676 0.15676 mean control.......... 0.19759 0.10621 std mean diff......... -11.201 13.866 mean raw eQQ diff..... 0.043243 0.32482 med raw eQQ diff..... 0 0 max raw eQQ diff..... 1 1 mean eCDF diff........ 0.020417 0.16241 med eCDF diff........ 0.020417 0.16241 max eCDF diff........ 0.040834 0.32482 var ratio (Tr/Co)..... 0.83791 1.3925 T-test p-value........ 0.1457 0.10983 ***** (V19) AGE:U74 ***** Before Matching After Matching mean treatment........ 18.778 18.778 mean control.......... 3.502 18.408 std mean diff......... 111.61 2.705 mean raw eQQ diff..... 15.676 3.8011 med raw eQQ diff..... 19 0 max raw eQQ diff..... 38 23 mean eCDF diff........ 0.1437 0.05423 med eCDF diff........ 0.045566 0.051095 max eCDF diff........ 0.62176 0.17336 var ratio (Tr/Co)..... 1.3476 1.1014 T-test p-value........ < 2.22e-16 0.76879 KS Bootstrap p-value.. < 2.22e-16 < 2.22e-16 KS Naive p-value...... < 2.22e-16 1.4081e-07 KS Statistic.......... 0.62176 0.17336 ***** (V20) AGE:U75 ***** Before Matching After Matching mean treatment........ 15.984 15.984 mean control.......... 3.9928 14.83 std mean diff......... 83.066 7.9911 mean raw eQQ diff..... 12.443 3.0547 med raw eQQ diff..... 14 0 max raw eQQ diff..... 33 23 mean eCDF diff........ 0.11256 0.046556 med eCDF diff........ 0.034245 0.041971 max eCDF diff........ 0.5 0.14234 var ratio (Tr/Co)..... 1.3444 0.99734 T-test p-value........ < 2.22e-16 0.42041 KS Bootstrap p-value.. < 2.22e-16 < 2.22e-16 KS Naive p-value...... < 2.22e-16 3.0159e-05 KS Statistic.......... 0.5 0.14234 Before Matching Minimum p.value: < 2.22e-16 Variable Name(s): EDUC AGE RE74 RE75 U74 U75 BLACK MARR EDUC:AGE RE75:MARR AGE:BLACK EDUC:U74 U74:U75 AGE:MARR AGE:RE75 EDUC:RE75 AGE:U74 AGE:U75 Number(s): 1 2 3 4 5 6 7 8 10 11 12 13 14 15 16 17 19 20 After Matching Minimum p.value: < 2.22e-16 Variable Name(s): EDUC AGE RE74 RE75 EDUC:AGE RE75:MARR AGE:BLACK EDUC:U74 AGE:MARR AGE:RE75 EDUC:RE75 AGE:U74 AGE:U75 Number(s): 1 2 3 4 10 11 12 13 15 16 17 19 20
Cochran W.G. (1965). The planning of observational studies of human populations. Journal of the Royal Statistical Society: Series A, 128(2), 234-255.
Cochran W.G. (1968). The effectiveness of adjustment by subclassification in removing bias in observational studies. Biometrics, 24(2), 295-313.
Dehejia R.V. and Wahba S. (1999). Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs. Journal of the American Statistical Association 94(448), 1053-1062.
Efron B. (1979). Bootstrap methods: Another look at the jackknife. Annals of Statistics 7(1), 1-26.
Freedman D.A. (2006). Statistical models for causation: What inferential leverage do they provide? Evaluation Review 30(6), 691-713.
Freedman D.A. (2008). On regression adjustments to experimental data. Advances in Applied Mathematics 40(2), 180-193.
Freedman D.A. (2009). Statistical Models: Theory and Practice, Cambridge University Press (2nd edition).
Garcia-Horton V. (2015). Topics in Bayesian Inference for Causal Effects. Harvard University Doctoral Dissertation.
Gelman A. and Pardoe I. (2007). Average predictive comparisons for models with nonlinearity, interactions, and variance components. Sociological Methodology, 37(1), 23-51.
Holland P.W. (1986). Statistics and causal inference. Journal of the American Statistical Association 81(396), 945-960.
Imbens G.W. and D.B. Rubin (2015). Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction, Cambridge University Press (1st edition).
Künzel S.R., Sekhon J.S., Bickel P.J., Yu B. (2019). Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences of the United States of America 116(10), 4156–4165.
LaLonde R.J. (1986). Evaluating the econometric evaluations of training programs with experimental data. The American Economic Review 76(4), 604-620.
Little R.J.A. and Rubin D.B. (2002). Statistical Analysis With Missing Data. Wiley-Interscience (2nd edition).
Lu M., Sadiq S., Feaster D.J., Ishwaran H. (2018). Estimating individual treatment effect in observational data Using random forest methods. Journal of Computational and Graphical Statistics 27(1), 209-219.
Pearl, J. (2009) Causality, Cambridge University Press (2nd edition).
Rubin D.B. (1973). Matching to remove bias in observational studies. Biometrics, 29(1), 159-183.